Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Binary Neural Architecture Search

(a)

(b)

(c)

FIGURE 4.2

(a) A cell containing four intermediate nodes B1, B2, B3, B4 that apply sampled operations

on the input node B0. B0 is from the output of the last cell. The output node concatenates

the outputs of the four intermediate nodes. (b) Gabor Filter. (c) A generic denoising block.

Following [253], it wraps the denoising operation with a 1 × 1 convolution and an identity

skip connection [84].

we progressively abandon the worst-performing operation and sample the operations with

little expectations but a signiﬁcant variance for each edge. Unlike [291], which uses the

performance as the evaluation metric to decide which operation should be pruned, we use

the anti-bandit algorithm described in Section 4.2.1 to make a decision.

Following UCB in the bandit algorithm, we obtain the initial performance for each

operation on every edge. Speciﬁcally, we sample one of the K operations in Ω⁽^i,j⁾for every

edge, then obtain the validation accuracy a, which is the initial performance m⁽^i,j⁾

k,0

adversarially training the sampled network for one epoch and ﬁnally assigning this accuracy

to all the sampled operations.

By considering the conﬁdence of the kth operation using Eq. 4.8, the LCB is calculated

sL(o⁽^i,j⁾

) = m⁽^i,j⁾

k,t

−

2 log N

n⁽^i,j⁾

k,t

(4.9)

where N is the total number of samples, n⁽^i,j⁾

k,t

refers to the number of times the kth operation

of the edge (i, j) has been selected and t is the epoch index. The ﬁrst item in Eq. 4.9 is the

value term (see Eq. 4.2) which favors the operations that look good historically, and the

second is the exploration term (see Eq. 4.3) which allows operations to get an exploration

bonus that grows with log N. The selection probability for each operation is deﬁned as

p(o⁽^i,j⁾

) =

exp{−sL(o⁽^i,j⁾

)}

m ^exp^{−^s^L⁽^o⁽^i,j⁾

)}

(4.10)

The minus sign in Eq. 4.10 means that we prefer to sample operations with a smaller

conﬁdence. After sampling one operation for every edge based on p(o⁽^i,j⁾

), we obtain the

validation accuracy a by training adversarially the sampled network for one epoch, and then

update the performance m⁽^i,j⁾

k,t

that historically indicates the validation accuracy of all the

sampled operations o⁽^i,j⁾

m⁽^i,j⁾

k,t

= (1 −λ)m⁽^i,j⁾

k,t−1 ⁺^λ^∗^a,

(4.11)

where λ is a hyperparameter.